A Polytomous Scoring Approach Based on Item Response Time

psychometrics
response time
IRT

In this post, we introduce an easy and practical way to deal with not-reached items in a low-stakes assessment. First, we introduce a polytomous scoring approach to deal with not-reach items in computerized low-stakes assessments and then demonstrate how to implement the polytomous scoring approach using R. This approach does not make any explicit assumption regarding the association between not-reached items and student ability, but only considers optimal time use, hence engagement and effortful response behavior when dealing with not-reached items.

(10 min read)

Okan Bulut http://www.okanbulut.com/ (University of Alberta)https://www.ualberta.ca , Guher Gorgun (University of Alberta)https://www.ualberta.ca
02-09-2021
Photo by Veri Ivanova on Unsplash

Introduction

Low-stakes assessments (e.g., formative assessments and progress monitoring measures in K-12) usually have no direct consequences for students. Therefore, some students may not show effortful response behavior when attempting the items on such assessments and leave some items unanswered. These items are typically referred to as not-reached items. For example, some students may try to answer all of the items rapidly and complete the assessment in unrealistically short amounts of time. Oppositely, some students may spend unrealistically long amounts of time on each item and may not finish answering all of the items within the allotted time. Furthermore, students may leave items unanswered due to test speededness-the situation where the allotted time does not allow a large number of students to fully consider all items on the assessment (Lu and Sireci 2007).

In practice, not-reached items are often treated as either incorrect or not-administered (i.e., NA) in the estimation of item and person parameters. However, when the proportion of not-reached items is high, these approaches may yield biased parameter estimates and thereby threatening the validity of assessment results. To date, researchers proposed various model-based approaches to deal with not-reached items, such as modeling valid responses and not-reached items jointly in a tree-based item response theory (IRT) model (e.g., Debeer, Janssen, and De Boeck (2017)) or modeling proficiency and tendency to omit items as distinct latent traits (e.g., Pohl, Gräfe, and Rose (2014)). However, these are typically complex models that would not be easy to use in operational settings.

Response time spent on each item in an assessment is often considered as a strong proxy for students’ engagement with the items (e.g., Kuhfeld and Soland (2020), Pohl, Ulitzsch, and Davier (2019), Rios et al. (2017)). To date, several researchers demonstrated the utility of response times in reducing the effects of non-effortful response behavior such as rapid guessing (e.g., Kuhfeld and Soland (2020), Pohl, Ulitzsch, and Davier (2019), Wise and Kong (2005)). By identifying and removing responses where rapid guessing occurred, the accuracy of item and person parameter estimates can be improved, without having to apply a complex modeling approach.

In this post, we will demonstrate an alternative method that considers not only students with rapid guessing behavior but also students who spend too much time on each item and thereby leaving many items unanswered. In the following sections, we will briefly describe how our approach works and then demonstrate the use of this approach in R.

Polytomous Scoring

In our recent study (Gorgun and Bulut 2021), we have proposed a new scoring approach that utilizes response times to transform dichotomous responses into polytomous responses. With our scoring approach, students are able to receive a partial credit on their responses depending on how accurately and rapidly they answer the items. This approach combines the speed and accuracy in the scoring process to alleviate the negative impact of not-reached items on the estimation of item and person parameters.

To conceptualize our scoring approach, we introduce the concept of optimal time that refers to spending a reasonable amount of time interval when responding to an item. Optimal time allows us to make a distinction between students who spend optimal time but miss the item and students who spend too much time on the item and yet answer it incorrectly. By using response time, we group students into three categories:

  1. Optimal time users who answer the item using the the optimal time
  2. Rapid guessers who answer the item in an unrealistically short amount of time, and
  3. Slow respondents who answer the item in an unrealistically long amount of time.

If an assessment is timed, students are expected to adjust their speed to attempt as many items as possible within allotted time. Therefore, spending too little time (rapid guessers) or too much time (slow respondents) on a given item can be considered an outcome of disengaged response behavior. Our scoring approach enables assigning partial credit to optimal time users who answer the item incorrectly but spend optimal time when attempting the item. Thus, it allows to do more fine-grained analysis of response behavior.

How Does It Work?

The polytomous scoring approach can be applied using the following steps:

  1. We separate response time for correct and incorrect responses and then find two median response times for each item: one for correct responses and another for incorrect responses. The median response time is used to avoid the outliers in the response time distribution.

  2. We use the normative threshold (NT) approach introduced by Wise and Kong (2005). This process gives us two cut-off values to divide the response time distribution into three regions: optimal time users, rapid guessers, and slow respondents. For example, we can use 25% and 175% of the median response times to specify the optimal time interval1.

  3. After finding the cut-off points for the response time distributions for each item, we select a scoring range. Here we can choose a scoring range of 0 to 3 points or 0 to 4 points.

    • If the scoring range is 0 to 4:
      • Based on the correct response time distribution, we assign 4 points to the fastest students, 3 points to middle region, a.k.a optimal time users, and 2 points to the slowest students.
      • Based on the incorrect response time distribution, we assign 0 points to rapid guessers and slow respondents, and 1 point to the optimal time users.
    • If the scoring range is 0 to 3, the same scoring rule applies for students with incorrect responses. However, for the correct response time distribution, 2 points are given to both rapid guessers and slow respondents and 3 points are assigned to optimal time users who are in the middle region of the correct response time distribution.
  4. We determine how to deal with not-reached items. We can choose to treat not-reached items as either not-administered (i.e, missing) or incorrect.

Now, let’s see how the polytomous scoring approach works in R.

Example

To illustrate the polytomous scoring approach, we use response data from a sample of 5000 students who participated in a hypothetical assessment with 40 items. In the response data,

The data also includes students’ response time (in seconds) for each item. The data as a comma-separated-values file (dichotomous_data.csv) is available here.

Now let’s import the data into R and then preview its content.

data <- read.csv("dichotomous_data.csv", header = TRUE)

# Item responses
head(data[,1:40])

# Response times
head(data[,41:80])
data <- read.csv("dichotomous_data.csv", header = TRUE)

paged_table(data[,1:40], options = list(cols.print = 12))
paged_table(data[,41:80], options = list(cols.print = 12))

Next, we create a scoring function to transform dichotomous responses into polytomous responsed based on the polytomous scoring approach described above. The polyscore function requires the following arguments:

polyscore <- function(response, time, max.score, na.handle = "NA", not.reached, not.answered,  
                      correct = c(0.25, 1.75), incorrect = c(0.25, 1.75)) {
  
  # Find response time thresholds
  median.time.correct1 <- median(time[which(response==1)], na.rm = TRUE)*correct[1]
  median.time.correct2 <- median(time[which(response==1)], na.rm = TRUE)*correct[2]
  median.time.incorrect1 <- median(time[which(response==0)], na.rm = TRUE)*incorrect[1]
  median.time.incorrect2 <- median(time[which(response==0)], na.rm = TRUE)*incorrect[2]
  
  # Recode dichotomous responses as polytomous
  if(max.score == 3) {
    response <- ifelse(response == 1 & time < median.time.correct1, 2,
                       ifelse(response == 1 & time > median.time.correct2, 2,
                              ifelse(response == 1 & 
                                       time > median.time.correct1 & 
                                       time < median.time.correct2, 3, response)))  
    response <- ifelse(response == 0 & time < median.time.incorrect1, 0,
                       ifelse(response == 0 & time > median.time.incorrect2, 0, 
                              ifelse(response == 0 & 
                                       time > median.time.incorrect1 & 
                                       time < median.time.incorrect2, 1, response)))
  } else if (max.score == 4)  {
    response <- ifelse(response == 1 & time < median.time.correct1, 4,
                       ifelse(response == 1 & time > median.time.correct2, 2, 
                              ifelse(response == 1 & 
                                       time > median.time.correct1 & 
                                       time < median.time.correct2, 3, response)))
    response <- ifelse(response == 0 & time < median.time.incorrect1, 0,
                       ifelse(response == 0 & time > median.time.incorrect2, 0, 
                              ifelse(response == 0 & 
                                       time > median.time.incorrect1 & 
                                       time < median.time.incorrect2, 1, response)))
  }
  
  # Set not-answered responses as incorrect
  if(!is.null(not.answered)) {
    response.recoded <- ifelse(response == not.answered, 0, response)
  } else {
    response.recoded <- response
  }
  
  # Set not-reached responses as NA or incorrect
  if(na.handle == "IN") {
    response.recoded <- ifelse(response.recoded == not.reached, 0, response.recoded)
  } else {
    response.recoded <- ifelse(response.recoded == not.reached, NA, response.recoded)
  }
  
  return(response.recoded)
} 

Before we move to the polyscore function, let’s see how rapid guessers, slow respondents, and optimal time users are identified using one of the items (item 1).

library("patchwork")
library("ggplot2")

# Response time distribution for correct
p1 <- ggplot(data = data[data$item_1==1, ], 
             aes(x = rt_1)) +
  geom_histogram(color = "white", 
                 fill = "steelblue", 
                 bins = 40) +
  geom_vline(xintercept = median(data[data$item_1==1, "rt_1"])*0.25, 
             linetype="dashed", color = "red", size = 1) +
  geom_vline(xintercept = median(data[data$item_1==1, "rt_1"])*1.75, 
             linetype="dashed", color = "red", size = 1) +
  labs(x = "Response Time for Item 1 (Correct)") +
  theme_bw()

# Response time distribution for incorrect
p2 <- ggplot(data = data[data$item_1==0, ], 
             aes(x = rt_1)) +
  geom_histogram(color = "white", 
                 fill = "steelblue", 
                 bins = 40) +
  geom_vline(xintercept = median(data[data$item_1==0, "rt_1"])*0.25, 
             linetype="dashed", color = "red", size = 1) +
  geom_vline(xintercept = median(data[data$item_1==0, "rt_1"])*1.75, 
             linetype="dashed", color = "red", size = 1) +
  labs(x = "Response Time for Item 1 (Incorrect)") +
  theme_bw()

(p1 / p2)

Now we can go ahead and implement polytomous scoring with our data. First, we separate the response and response time portions of the data.

resp_data <- data[, 1:40]

time_data <- data[, 41:80]

Next, we apply the polytomous scoring approach using different combinations of max.score and na.handle. We apply the polyscore function to each item using a loop.

polydata3_NA <- matrix(NA, nrow = nrow(resp_data), ncol = ncol(resp_data))
polydata3_IN <- matrix(NA, nrow = nrow(resp_data), ncol = ncol(resp_data))
polydata4_NA <- matrix(NA, nrow = nrow(resp_data), ncol = ncol(resp_data))
polydata4_IN <- matrix(NA, nrow = nrow(resp_data), ncol = ncol(resp_data))

for(i in 1:ncol(resp_data)) {
  
  polydata3_NA[,i] <- polyscore(resp_data[,i], time_data[,i], max.score = 3, 
                                na.handle = "NA", not.reached = 9, not.answered = 8,  
                                correct = c(0.25, 1.75), incorrect = c(0.25, 1.75))
  
  polydata3_IN[,i] <- polyscore(resp_data[,i], time_data[,i], max.score = 3, 
                                na.handle = "IN", not.reached = 9, not.answered = 8,  
                                correct = c(0.25, 1.75), incorrect = c(0.25, 1.75))
  
  polydata4_NA[,i] <- polyscore(resp_data[,i], time_data[,i], max.score = 4, 
                                na.handle = "NA", not.reached = 9, not.answered = 8,  
                                correct = c(0.25, 1.75), incorrect = c(0.25, 1.75))
  
  polydata4_IN[,i] <- polyscore(resp_data[,i], time_data[,i], max.score = 4, 
                                na.handle = "IN", not.reached = 9, not.answered = 8,  
                                correct = c(0.25, 1.75), incorrect = c(0.25, 1.75))
} 

Let’s quickly see how one of our recoded data looks like after applying polytomous scoring.

polydata3_NA <- as.data.frame(polydata3_NA)

head(polydata3_NA)

We will demonstrate how to estimate item and person parameters with item-response theory (IRT) approach. However, note that this scoring approach is flexible enough to be applied with classical test theory (CTT). When applied in the CTT framework, a high score indicates that the student’s level of combined ability and engagement was high in the assessment. With this approach, speededness is no longer a nuisance variable but is used in the operationalization of ability (Tijmstra and Bolsinova 2018).

We will import the necessary packages for IRT.

model <- 'F = 1-40' 

### Polytomous: GRM
results.grm <- mirt(data=polydata3_NA, model=model, itemtype="graded", SE=TRUE, verbose=FALSE)

### Item parameters 
coef.grm <- coef(results.grm, IRTpars=TRUE, simplify=TRUE)
items.grm <- as.data.frame(coef.grm$items)

### Person parameters
theta.grm <- matrix(fscores(results.grm, method='EAP'))

Finally, let’s examine test information function and standard error of measurement based on the estimated person parameters using the polytomous scoring approach.

Conclusion

xxx

Debeer, Dries, Rianne Janssen, and Paul De Boeck. 2017. “Modeling Skipped and Not-Reached Items Using IRTrees.” Journal of Educational Measurement 54 (3): 333–63.
Gorgun, Guher, and Okan Bulut. 2021. “A Polytomous Scoring Approach to Handle Not-Reached Items in Low-Stakes Assessments.” Educational and Psychological Measurement. https://doi.org/10.1177/0013164421991211.
Kuhfeld, Megan, and James Soland. 2020. “Using Assessment Metadata to Quantify the Impact of Test Disengagement on Estimates of Educational Effectiveness.” Journal of Research on Educational Effectiveness 13 (1): 147–75.
Lu, Ying, and Stephen G. Sireci. 2007. “Validity Issues in Test Speededness.” Educational Measurement: Issues and Practice 26 (4): 29–37. https://doi.org/https://doi.org/10.1111/j.1745-3992.2007.00106.x.
Pohl, Steffi, Linda Gräfe, and Norman Rose. 2014. “Dealing with Omitted and Not-Reached Items in Competence Tests: Evaluating Approaches Accounting for Missing Responses in Item Response Theory Models.” Educational and Psychological Measurement 74 (3): 423–52.
Pohl, Steffi, Esther Ulitzsch, and Matthias von Davier. 2019. “Using Response Times to Model Not-Reached Items Due to Time Limits.” Psychometrika 84 (3): 892–920.
Rios, Joseph A, Hongwen Guo, Liyang Mao, and Ou Lydia Liu. 2017. “Evaluating the Impact of Careless Responding on Aggregated-Scores: To Filter Unmotivated Examinees or Not?” International Journal of Testing 17 (1): 74–104.
Tijmstra, Jesper, and Maria Bolsinova. 2018. “On the Importance of the Speed-Ability Trade-Off When Dealing with Not Reached Items.” Frontiers in Psychology 9: 964.
Wise, Steven L, and Xiaojing Kong. 2005. “Response Time Effort: A New Measure of Examinee Motivation in Computer-Based Tests.” Applied Measurement in Education 18 (2): 163–83.

  1. Smaller percentages can be used for obtaining more conservative thresholds.↩︎

  2. These are items that students view but skip without selecting a valid response option.↩︎

References